Brian Weinfeld
Mary 5, 2018
Introduction
What is the relationship between the Top Grossing Blockbusters of the last decade and the gender of the movie’s main stars?
Top.Movie.Query <- function(years, rank){
years %>%
map_(~getURL(paste0('http://www.boxofficemojo.com/yearly/chart/?yr=', .x, '&p=.htm')) %>%
htmlParse() %>%
xpathSApply('//*[@id="body"]/table[3]//tr//td', xmlValue) %>%
.[15:914] %>%
matrix(ncol=9, byrow=T) %>%
as.data.frame() %>%
filter(row_number() <= rank) %>%
mutate(Movie=str_replace(V2, paste0('(.*?)( \\(', .x, '\\))$'), '\\1'),
Year = .x) %>%
select(Movie, Year)
)
}
top.movies <- Top.Movie.Query(2017:2008, 50)
all.movies <- map2_df(top.movies$Movie, top.movies$Year, ~Movie.API.Query(.x, .y))The first function Top.Movie.Query scrapes boxofficemojo.com for the names of the top 50 domestic grossing blockbusters between 2008 and 2017
Movie.API.Query <- function(movie, year){
print(movie)
initial.query <- GET('http://www.omdbapi.com/',
add_headers('Content-Type'='application/json', 'Accept-Encoding'='gzip'),
query=list('t'=movie, 'apikey'=apikey, 'y'=year, 'plot'='full')
) %>%
content(as='text') %>%
fromJSON(flatten=FALSE) %>%
.[-15] %>%
as.tibble()
if(ncol(initial.query) == 2){
print('Movie Not Found!')
tibble(Title=movie, Year=as.character(year))
}else{
initial.query %>%
select(c(1, 2, 3, 5, 6, 9, 10, 14, 15, 18, 21)) %>%
mutate(Genre = str_extract(Genre, '([^,]+)'),
Runtime = str_extract(Runtime, '(\\d+)'),
Actors = IMDB.Star.Query(imdbID),
BoxOffice = parse_number(BoxOffice)
) %>%
separate(Actors, c('Lead_1', 'Lead_2'), sep=', ') %>%
mutate(Lead_1_Male = Wikipedia.Gender.Query(Lead_1),
Lead_2_Male = Wikipedia.Gender.Query(Lead_2)
) %>%
select(c(1:6, 13, 7, 14), everything())
}
}The second function Movie.API.Query accesses an API that called OMDB and requests each of the movies. This function called two other functions to fill in missing information, namely the stars of the movie and the genders of those stars.
IMDB.Star.Query <- function(movie.id){
Sys.sleep(1)
getURL(paste0('https://www.imdb.com/title/', movie.id,'/')) %>%
htmlParse() %>%
xpathSApply('//*[@id="title-overview-widget"]//span[@itemprop="actors"]//a', xmlValue) %>%
.[1:2] %>%
paste(collapse=', ')
}Wikipedia.Gender.Query <- function(lead){
Sys.sleep(0.5)
lead <- str_replace_all(lead, ' ', '_')
initial.query <- getURL(paste0('https://en.wikipedia.org/wiki/', lead)) %>%
htmlParse() %>%
xpathSApply('//*[@id="mw-content-text"]/div/p[position()<3]', xmlValue) %>%
unlist() %>%
paste(collapse='')
if(str_detect(initial.query, 'may refer to:')){
initial.query <- getURL(paste0('https://en.wikipedia.org/wiki/', lead, '_(actor)')) %>%
htmlParse() %>%
xpathSApply('//*[@id="mw-content-text"]/div/p[position()<3]', xmlValue) %>%
unlist() %>%
paste(collapse='')
}
if(str_detect(initial.query, 'actor') & !str_detect(initial.query, 'actress')){
return(TRUE)
}else if(str_detect(initial.query, 'actress') & !str_detect(initial.query, 'actor')){
return(FALSE)
}else{
return(NA)
}
}The IMDB.Star.Query uses the url provided by the API to scrape the two main leads of each film. The Wikipedia.Gender.Query scrapes Wikipedia in an effort to determine the gender of the star by looking for the woards ‘actor’ or ‘actress’. If a determination cannot be made, it returns null. This method had over a 95% success rate.
| Title | Year | Rated | Runtime | Genre | Lead_1 | Lead_1_Male | Lead_2 | Lead_2_Male | BoxOffice | Type |
|---|---|---|---|---|---|---|---|---|---|---|
| The Dark Knight | 2008 | PG-13 | 152 | Action | Christian Bale | TRUE | Heath Ledger | TRUE | 533316061 | Male/Male |
| Avatar | 2009 | PG-13 | 162 | Action | Sam Worthington | TRUE | Zoe Saldana | FALSE | 749700000 | Male/Female |
| Marvel’s The Avengers | 2012 | PG-13 | 143 | Action | Robert Downey Jr. | TRUE | Chris Evans | TRUE | 623357910 | Male/Male |
| The Dark Knight Rises | 2012 | PG-13 | 164 | Action | Christian Bale | TRUE | Tom Hardy | TRUE | 448130642 | Male/Male |
| Star Wars: The Force Awakens | 2015 | PG-13 | 136 | Action | Daisy Ridley | FALSE | John Boyega | TRUE | 936658640 | Female/Male |
| Jurassic World | 2015 | PG-13 | 124 | Action | Chris Pratt | TRUE | Bryce Dallas Howard | FALSE | 528757749 | Male/Female |
| Rogue One: A Star Wars Story | 2016 | PG-13 | 133 | Action | Felicity Jones | FALSE | Diego Luna | TRUE | 532171696 | Female/Male |
| Finding Dory | 2016 | PG | 97 | Animation | Ellen DeGeneres | FALSE | Albert Brooks | TRUE | 486292984 | Female/Male |
| Star Wars: The Last Jedi | 2017 | PG-13 | 152 | Action | Daisy Ridley | FALSE | John Boyega | TRUE | 619117636 | Female/Male |
| Beauty and the Beast | 2017 | PG | 129 | Family | Emma Watson | FALSE | Dan Stevens | TRUE | 503974884 | Female/Male |
Initial Analysis
| Type | n |
|---|---|
| Female/Female | 32 |
| Female/Male | 79 |
| Male/Female | 169 |
| Male/Male | 220 |
| Title | Year | BoxOffice | Type |
|---|---|---|---|
| Frozen | 2013 | $400,736,600 | Female/Female |
| Maleficent | 2014 | $190,871,149 | Female/Female |
| Cinderella | 2015 | $183,327,144 | Female/Female |
| The Help | 2011 | $169,705,587 | Female/Female |
| Hidden Figures | 2016 | $169,385,416 | Female/Female |
| Title | Year | BoxOffice | Type |
|---|---|---|---|
| Star Wars: The Force Awakens | 2015 | $936,658,640 | Female/Male |
| Star Wars: The Last Jedi | 2017 | $619,117,636 | Female/Male |
| Rogue One: A Star Wars Story | 2016 | $532,171,696 | Female/Male |
| Beauty and the Beast | 2017 | $503,974,884 | Female/Male |
| Finding Dory | 2016 | $486,292,984 | Female/Male |
| Title | Year | BoxOffice | Type |
|---|---|---|---|
| Avatar | 2009 | $749,700,000 | Male/Female |
| Jurassic World | 2015 | $528,757,749 | Male/Female |
| Transformers: Revenge of the Fallen | 2009 | $402,076,689 | Male/Female |
| Jumanji: Welcome to the Jungle | 2017 | $393,201,353 | Male/Female |
| Guardians of the Galaxy Vol. 2 | 2017 | $389,804,217 | Male/Female |
| Title | Year | BoxOffice | Type |
|---|---|---|---|
| Marvel’s The Avengers | 2012 | $623,357,910 | Male/Male |
| The Dark Knight | 2008 | $533,316,061 | Male/Male |
| The Dark Knight Rises | 2012 | $448,130,642 | Male/Male |
| Avengers: Age of Ultron | 2015 | $429,113,729 | Male/Male |
| Toy Story 3 | 2010 | $414,984,497 | Male/Male |
Also can put rating breakdown here if needed
Sentiment Analysis
Suggestions
| Type | word | n | tf | idf | tf_idf |
|---|---|---|---|---|---|
| Male/Male | frustration | 1 | 0.0003729 | 1.3862944 | 0.0005169 |
| Male/Male | mystery | 3 | 0.0011186 | 0.2876821 | 0.0003218 |
| Male/Male | prolong | 1 | 0.0003729 | 1.3862944 | 0.0005169 |
| Male/Male | colonel | 4 | 0.0014914 | 0.6931472 | 0.0010338 |
| Male/Male | major | 3 | 0.0011186 | 0.6931472 | 0.0007753 |
| Male/Male | drone | 1 | 0.0003729 | 1.3862944 | 0.0005169 |
| Male/Male | survive | 11 | 0.0041014 | 0.6931472 | 0.0028429 |
| Male/Male | specialist | 2 | 0.0007457 | 1.3862944 | 0.0010338 |
Amy Adams and Cameron Diaz star in “Action Blockbuster”! Frustrated by her commanding officer’s unwillingness to address an ongoing civil war on a foreign island nation, Major Jennifer Slater (Cameron Diaz) enlists the help of surivial specialist Annie (Amy Adams). Together they journey to the secretive island in an effort to end the prolonged conflict. But what they discover there will shake the world to it’s very core. Can they solve the mystery of the island before Jennifer’s reneage Colonel can nuke the island via drone? You won’t want to miss a moment of “Action Blockbuster!”